Bioinformatics (Thomas Dandekar, Meik Kunz)

6.1

Fast Search: BLAST as an Example for a Heuristic Search

The following accelerations are used for sequence comparisons (technical term: sequence

alignment):

A so-called indexing first considers whether the database entry contains single short

words (3 letters for protein sequences or 11 nucleotides for nucleic acid sequences) that

are similar to the sequence of the molecule. If this is the case (a first “hit” or “hit” is

found), the system immediately searches whether there is another hit not too far away

(predefined window length). Only when this second hit is found does the BLAST algo

rithm start checking whether the remaining sequence letters of this database entry match

the search sequence. This exact comparison of the two letter sequences (“alignment”) is

also accelerated by “dynamic programming”, so that step by step more memory is avail

able for the comparison of search sequence and database entry.

Thus, we see two principles of bioinformatics: Since all important biomolecules (DNA,

RNA, proteins, but also, for example, carbohydrates and lipids) are built from recurring

building units, most biomolecules can be recognized by the sequence of these building

units, i.e. by their letter sequence (with each class of molecules using its own alphabet).

In the meantime, however, so much information about biomolecules has been stored in

large databases that a major part of the informatics work in bioinformatics consists of

using fast computational rules (algorithms) and conveniently constructed databases to

cope with this flood of information so well that the correct biomolecule can be identified

as quickly as possible (see Mount et al., 2004).

If you use BLAST on the NCBI website (https://blast.ncbi.nlm.nih.gov/Blast.cgi), for

example, I get a result very quickly (a few seconds, sometimes one to 2 minutes). In that

time, BLAST actually screens several billion nucleotides and many millions of sequence

entries. This is an amazing speed-up. We now want to understand how to speed up bioin

formatics searches in general so that you get a result quickly. This usually happens by

foregoing the perfect search and taking a program that uses shortcuts to get a near-perfect

solution.

When searching for a similar sequence, one way to do an exact search would be to

compare letter by letter to determine exactly where a local match with high similarity is.

Local similarity is therefore a popular choice for protein function searches, because you

can then move on from a subsequence whose similarity was found in the database to the

next best similarity. After I have recognized that a partial sequence, usually a protein

domain, has a certain function, I shorten my protein by this domain and now search for a

hit in the database with the remaining sequence, which then not so rarely assigns the next

piece of the sequence, often a whole domain again, with a suspected function, and so on.

6.1

2013

6 Extremely Fast Sequence Comparisons Identify All the Molecules That Are Present…